About this Notebook

This notebook demonstrates the experience of using ML Workbench to create a machine learning model for text classification and setting it up for online prediction. Training the model is done "locally" inside Datalab. In the next notebook (Text Classification --- 20NewsGroup (large data)), it demonstrates how to do it by using Cloud ML Engine services.

If you have any feedback, please send them to datalab-feedback@google.com.

Data

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics. The classification problem is to identify the newsgroup a post was summited to, given the text of the post.

There are a few versions of this dataset from different sources online. Below, we use the version within scikit-learn which is already split into a train and test/eval set. For a longer introduction to this dataset, see the scikit-learn website

Download Data


In [59]:
import numpy as np
import pandas as pd
import os
import re
import csv
from sklearn.datasets import fetch_20newsgroups



In [60]:
# data will be downloaded. Note that an error message saying something like "No handlers could be found for 
# logger sklearn.datasets.twenty_newsgroups" might be printed, but this is not an error.
news_train_data = fetch_20newsgroups(subset='train', shuffle=True, random_state=42, remove=('headers', 'footers', 'quotes'))
news_test_data = fetch_20newsgroups(subset='test', shuffle=True, random_state=42, remove=('headers', 'footers', 'quotes'))


Cleaning the Raw Data

Printing the 3rd element in the test dataset shows the data contains text with newlines, punctuation, misspellings, and other items common in text documents. To build a model, we will clean up the text by removing some of these issues.


In [61]:
news_train_data.data[2], news_train_data.target_names[news_train_data.target[2]]


Out[61]:
(u'well folks, my mac plus finally gave up the ghost this weekend after\nstarting life as a 512k way back in 1985.  sooo, i\'m in the market for a\nnew machine a bit sooner than i intended to be...\n\ni\'m looking into picking up a powerbook 160 or maybe 180 and have a bunch\nof questions that (hopefully) somebody can answer:\n\n* does anybody know any dirt on when the next round of powerbook\nintroductions are expected?  i\'d heard the 185c was supposed to make an\nappearence "this summer" but haven\'t heard anymore on it - and since i\ndon\'t have access to macleak, i was wondering if anybody out there had\nmore info...\n\n* has anybody heard rumors about price drops to the powerbook line like the\nones the duo\'s just went through recently?\n\n* what\'s the impression of the display on the 180?  i could probably swing\na 180 if i got the 80Mb disk rather than the 120, but i don\'t really have\na feel for how much "better" the display is (yea, it looks great in the\nstore, but is that all "wow" or is it really that good?).  could i solicit\nsome opinions of people who use the 160 and 180 day-to-day on if its worth\ntaking the disk size and money hit to get the active display?  (i realize\nthis is a real subjective question, but i\'ve only played around with the\nmachines in a computer store breifly and figured the opinions of somebody\nwho actually uses the machine daily might prove helpful).\n\n* how well does hellcats perform?  ;)\n\nthanks a bunch in advance for any info - if you could email, i\'ll post a\nsummary (news reading time is at a premium with finals just around the\ncorner... :( )\n--\nTom Willis  \\  twillis@ecn.purdue.edu    \\    Purdue Electrical Engineering',
 'comp.sys.mac.hardware')

In [62]:
def clean_and_tokenize_text(news_data):
    """Cleans some issues with the text data
    Args:
        news_data: list of text strings
    Returns:
        For each text string, an array of tokenized words are returned in a list
    """
    cleaned_text = []
    for text in news_data:
        x = re.sub('[^\w]|_', ' ', text)  # only keep numbers and letters and spaces
        x = x.lower()
        x = re.sub(r'[^\x00-\x7f]',r'', x)  # remove non ascii texts
        tokens = [y for y in x.split(' ') if y] # remove empty words
        tokens = ['[number]' if x.isdigit() else x for x in tokens] # convert all numbers to '[number]' to reduce vocab size.
        cleaned_text.append(tokens)
    return cleaned_text



In [63]:
clean_train_tokens = clean_and_tokenize_text(news_train_data.data)
clean_test_tokens = clean_and_tokenize_text(news_test_data.data)


Get Vocabulary

We will need to filter the vocabulary to remove high frequency words and low frequency words.


In [64]:
def get_unique_tokens_per_row(text_token_list):
    """Collect unique tokens per row.
    Args:
        text_token_list: list, where each element is a list containing tokenized text
    Returns:
        One list containing the unique tokens in every row. For example, if row one contained
        ['pizza', 'pizza'] while row two contained ['pizza', 'cake', 'cake'], then the output list
        would contain ['pizza' (from row 1), 'pizza' (from row 2), 'cake' (from row 2)]
    """
    words = []
    for row in text_token_list:
        words.extend(list(set(row)))
    return words



In [65]:
# Make a plot where the x-axis is a token, and the y-axis is how many text documents
# that token is in. 
words = pd.DataFrame(get_unique_tokens_per_row(clean_train_tokens) , columns=['words'])
token_frequency = words['words'].value_counts() # how many documents contain each token.
token_frequency.plot(logy=True)


Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f50fcbd81d0>

In [66]:
vocab = token_frequency[np.logical_and(token_frequency < 1000, token_frequency > 10)]
vocab.plot(logy=True)


Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f5109c4efd0>

In [67]:
def filter_text_by_vocab(news_data, vocab):
    """Removes tokens if not in vocab.
    Args:
        news_data: list, where each element is a token list
        vocab: set containing the tokens to keep.
    Returns:
        List of strings containing the final cleaned text data
    """
    text_strs = []
    for row in news_data:
        words_to_keep = [token for token in row if token in vocab or token == '[number]']
        text_strs.append(' '.join(words_to_keep))
    return text_strs



In [68]:
clean_train_data = filter_text_by_vocab(clean_train_tokens, set(vocab.index))
clean_test_data = filter_text_by_vocab(clean_test_tokens, set(vocab.index))



In [69]:
# Check a few instances of cleaned data
clean_train_data[:3]


Out[69]:
[u'wondering enlighten car saw day [number] door sports car looked late 60s early 70s called doors small addition front bumper separate rest body model name engine specs years production car made history whatever info looking car e mail',
 u'fair number brave souls upgraded si clock oscillator shared experiences poll send brief message detailing experiences procedure top speed cpu rated speed add cards adapters heat hour usage per day floppy disk functionality [number] [number] [number] floppies especially requested next days add network knowledge base done clock upgrade haven answered poll',
 u'folks mac plus finally gave ghost weekend starting life 512k [number] market machine bit sooner intended looking picking powerbook [number] maybe [number] bunch questions hopefully somebody answer anybody dirt next round powerbook expected heard supposed summer haven heard anymore access wondering anybody info anybody heard rumors price drops powerbook line ones duo went through recently impression display [number] probably swing [number] got 80mb disk rather [number] feel better display yea looks great store wow opinions [number] [number] day day worth taking disk size money hit active display realize real subjective question played around machines computer store figured opinions somebody actually uses machine daily might prove helpful perform bunch advance info email ll post summary news reading premium finals around corner tom ecn purdue edu purdue electrical engineering']

Save the Cleaned Data For Training


In [70]:
!mkdir -p ./data

with open('./data/train.csv', 'w') as f:
    writer = csv.writer(f, lineterminator='\n')
    for target, text in zip(news_train_data.target, clean_train_data):
        writer.writerow([news_train_data.target_names[target], text])
        
with open('./data/eval.csv', 'w') as f:
    writer = csv.writer(f, lineterminator='\n')
    for target, text in zip(news_test_data.target, clean_test_data):
        writer.writerow([news_test_data.target_names[target], text]) 
        
# Also save the vocab, which will be useful in making new predictions.
with open('./data/vocab.txt', 'w') as f:
    vocab.to_csv(f)


Create Model with ML Workbench

The MLWorkbench Magics are a set of Datalab commands that allow an easy code-free experience to training, deploying, and predicting ML models. This notebook will take the cleaned data from the previous notebook and build a text classification model. The MLWorkbench Magics are a collection of magic commands for each step in ML workflows: analyzing input data to build transforms, transforming data, training a model, evaluating a model, and deploying a model.

For details of each command, run with --help. For example, "%%ml train --help".

When the dataset is small (like with the 20 newsgroup data), there is little benefit of using cloud services. This notebook will run the analyze, transform, and training steps locally. However, we will take the locally trained model and deploy it to ML Engine and show how to make real predictions on a deployed model. Every MLWorkbench magic can run locally or use cloud services (adding --cloud flag).

The next notebook (Text Classification --- 20NewsGroup (large data)) in this sequence shows the cloud version of every command, and gives the normal experience when building models are large datasets. However, we will still use the 20 newsgroup data.


In [71]:
import google.datalab.contrib.mlworkbench.commands  # This loads the '%%ml' magics


First, define the dataset we are going to use for training.


In [72]:
%%ml dataset create
name: newsgroup_data
format: csv
train: ./data/train.csv
eval: ./data/eval.csv
schema:
    - name: news_label
      type: STRING
    - name: text
      type: STRING



In [73]:
%%ml dataset explore
name: newsgroup_data


train data instances: 11314
eval data instances: 7532

Step 1: Analyze

The first step in the MLWorkbench workflow is to analyze the data for the requested transformations. We are going to build a bag of words representation on the text and use this in a linear model. Therefore, the analyze step will compute the vocabularies and related statistics of the data for traing.


In [74]:
%%ml analyze
output: ./analysis
data: newsgroup_data
features:
    news_label:
        transform: target
    text:
        transform: bag_of_words


Expanding any file patterns...
file list computed.
Analyzing file /content/datalab/docs/samples/contrib/mlworkbench/text_classification_20newsgroup/data/train.csv...
file /content/datalab/docs/samples/contrib/mlworkbench/text_classification_20newsgroup/data/train.csv analyzed.

In [75]:
!ls ./analysis


features.json  schema.json  stats.json	vocab_news_label.csv  vocab_text.csv

Step 2: Transform

This step is optional as training can start from csv data (the same data used in the analysis step). The transform step performs some transformations on the input data and saves the results to a special TensorFlow file called a TFRecord file containing TF.Example protocol buffers. This allows training to start from preprocessed data. If this step is not used, training would have to perform the same preprocessing on every row of csv data every time it is used. As TensorFlow reads the same data row multiple times during training, this means the same row would be preprocessed multiple times. By writing the preprocessed data to disk, we can speed up training. Because the the 20 newsgroups data is small, this step does not matter, but we do it anyway for illustration. This step is recommended if there are text column in a dataset, and required if there are image columns in a dataset.

We run the transform step for the training and eval data.


In [76]:
!rm -rf ./transform



In [77]:
%%ml transform --shuffle
output: ./transform
analysis: ./analysis
data: newsgroup_data


/usr/local/lib/python2.7/dist-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
WARNING:root:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.
/usr/local/lib/python2.7/dist-packages/apache_beam/coders/typecoders.py:135: UserWarning: Using fallback coder for typehint: Any.
  warnings.warn('Using fallback coder for typehint: %r.' % typehint)
WARNING:root:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.

In [78]:
# note: the errors_* files are all 0 size, which means no error.
!ls ./transform/ -l -h


total 3.1M
-rw-r--r-- 1 root root    0 Oct 19 21:13 errors_eval-00000-of-00001.txt
-rw-r--r-- 1 root root    0 Oct 19 21:13 errors_train-00000-of-00001.txt
-rw-r--r-- 1 root root 1.2M Oct 19 21:13 eval-00000-of-00001.tfrecord.gz
-rw-r--r-- 1 root root 2.0M Oct 19 21:13 train-00000-of-00001.tfrecord.gz

Create a "transformed dataset" to use in next step.


In [79]:
%%ml dataset create
name: newsgroup_transformed
train: ./transform/train-*
eval: ./transform/eval-*
format: transformed


Step 3: Training

MLWorkbench automatically builds standard TensorFlow models without you having to write any TensorFlow code.


In [80]:
# Training should use an empty output folder. So if you run training multiple times,
# use different folders or remove the output from the previous run.
!rm -fr ./train


The following training step takes about 10~15 minutes.


In [81]:
%%ml train
output: ./train
analysis: ./analysis/
data: newsgroup_transformed
model_args:
  model: linear_classification
  top-n: 5


TensorBoard was started successfully with pid 56037. Click here to access it.

Go to Tensorboard (link shown above) to monitor the training progress. Note that training stops when it detects accuracy is no longer increasing for eval data.


In [82]:
# You can also plot the summary events which will be saved with the notebook.

from google.datalab.ml import Summary

summary = Summary('./train')
summary.list_events()


Out[82]:
{u'accuracy': {'./train/train/eval'},
 u'global_step/sec': {'./train/train'},
 u'input_producer/fraction_of_32_full': {'./train/train'},
 u'loss': {'./train/train', './train/train/eval'},
 u'shuffle_batch/fraction_over_10_of_960_full': {'./train/train'}}

In [83]:
summary.plot(['loss', 'accuracy'])


The output of training is two models, one in training_output/model and another in training_output/evaluation_model. These tensorflow models are identical except the latter assumes the target column is part of the input and copies the target value to the output. Therefore, the latter is ideal for evaluation.


In [84]:
!ls ./train/


evaluation_model  model  schema_without_target.json  train

Step 4: Evaluation using batch prediction

Below, we use the evaluation model and run batch prediction locally. Batch prediction is needed for large datasets where the data cannot fit in memory. For demo purpose, we will use the training evaluation data again.


In [85]:
%%ml batch_predict
model: ./train/evaluation_model/
output: ./batch_predict
format: csv
data:
  csv: ./data/eval.csv


local prediction...
INFO:tensorflow:Restoring parameters from ./train/evaluation_model/variables/variables
done.

In [86]:
# It creates a results csv file, and a results schema json file.
!ls ./batch_predict


predict_results_eval.csv  predict_results_schema.json

Note that the output of prediction is a csv file containing the score for each label class. 'predicted_n' is the label for the nth largest score. We care about 'predicted', the final model prediction.


In [87]:
!head -n 5 ./batch_predict/predict_results_eval.csv


rec.autos,comp.sys.mac.hardware,rec.sport.baseball,sci.space,soc.religion.christian,0.291265,0.196776,0.0854143,0.0544117,0.0462434,rec.autos
comp.graphics,comp.windows.x,sci.space,rec.motorcycles,comp.os.ms-windows.misc,0.449551,0.23406,0.0345504,0.0335983,0.0327901,comp.windows.x
rec.motorcycles,rec.sport.baseball,comp.os.ms-windows.misc,alt.atheism,comp.sys.mac.hardware,0.0703423,0.0673938,0.0624348,0.0561686,0.0550833,alt.atheism
talk.politics.mideast,talk.politics.guns,talk.politics.misc,alt.atheism,sci.crypt,0.400145,0.337041,0.133643,0.122765,0.00307648,talk.politics.mideast
alt.atheism,rec.autos,sci.space,talk.religion.misc,comp.sys.mac.hardware,0.130497,0.0720769,0.0676606,0.0658658,0.0625985,talk.religion.misc

In [88]:
%%ml evaluate confusion_matrix --plot
csv: ./batch_predict/predict_results_eval.csv



In [89]:
%%ml evaluate accuracy
csv: ./batch_predict/predict_results_eval.csv


Out[89]:
accuracy count target
0 0.451411 319 alt.atheism
1 0.660668 389 comp.graphics
2 0.601523 394 comp.os.ms-windows.misc
3 0.556122 392 comp.sys.ibm.pc.hardware
4 0.605195 385 comp.sys.mac.hardware
5 0.668354 395 comp.windows.x
6 0.794872 390 misc.forsale
7 0.671717 396 rec.autos
8 0.728643 398 rec.motorcycles
9 0.831234 397 rec.sport.baseball
10 0.822055 399 rec.sport.hockey
11 0.623737 396 sci.crypt
12 0.544529 393 sci.electronics
13 0.669192 396 sci.med
14 0.700508 394 sci.space
15 0.693467 398 soc.religion.christian
16 0.615385 364 talk.politics.guns
17 0.667553 376 talk.politics.mideast
18 0.403226 310 talk.politics.misc
19 0.239044 251 talk.religion.misc
20 0.639272 7532 _all

Step 5: BigQuery to analyze evaluate results

Sometimes you want to query your prediction/evaluation results using SQL. It is easy.


In [90]:
# Create bucket
!gsutil mb gs://bq-mlworkbench-20news-lab
!gsutil cp -r ./batch_predict/predict_results_eval.csv gs://bq-mlworkbench-20news-lab


Creating gs://bq-mlworkbench-20news-lab/...
Copying file://./batch_predict/predict_results_eval.csv [Content-Type=text/csv]...
-
Operation completed over 1 objects/1.1 MiB.                                      

In [91]:
# Use Datalab's Bigquery API to load CSV files into table.

import google.datalab.bigquery as bq
import json

with open('./batch_predict/predict_results_schema.json', 'r') as f:
    schema = json.load(f)

# Create BQ Dataset
bq.Dataset('newspredict').create()

# Create the table
table = bq.Table('newspredict.result1').create(schema=schema, overwrite=True)
table.load('gs://bq-mlworkbench-20news-lab/predict_results_eval.csv', mode='overwrite',
           source_format='csv', csv_options=bq.CSVOptions(skip_leading_rows=1))


Out[91]:
Job bradley-playground/job_sGKkVQKLNfRK0j9VpFiIUO1CqcLO completed

Now, run any SQL queries on "table newspredict.result1". Below we query all wrong predictions.


In [92]:
%%bq query
SELECT * FROM newspredict.result1 WHERE predicted != target


Out[92]:
predictedpredicted_2predicted_3predicted_4predicted_5probabilityprobability_2probability_3probability_4probability_5target
sci.medrec.autosrec.sport.baseballsci.spacecomp.sys.mac.hardware0.09446280.07686470.07026260.06806750.0661313sci.electronics
sci.medrec.autossci.cryptalt.atheismsci.electronics0.1181120.07153690.06986220.06671280.0594781sci.crypt
sci.medrec.autosrec.motorcyclesalt.atheismcomp.os.ms-windows.misc0.4674490.06685020.05355730.0400070.0382802sci.electronics
sci.medrec.autossci.spacecomp.graphicscomp.sys.mac.hardware0.0987810.07400770.06607210.06588930.0632783soc.religion.christian
sci.medrec.autossci.spacerec.motorcyclescomp.graphics0.09911490.07147140.06632510.06509780.058196rec.sport.hockey
sci.medrec.autosrec.motorcyclestalk.politics.gunssci.electronics0.2437520.1485290.1412050.08222360.054355rec.autos
sci.medsci.cryptcomp.graphicssci.spacealt.atheism0.2352130.1786360.1070280.05899850.0547229sci.crypt
sci.medsci.cryptcomp.sys.ibm.pc.hardwaresci.spacealt.atheism0.08891530.08580860.07762070.07188960.0637095sci.crypt
sci.medsci.crypttalk.politics.gunssci.spacerec.autos0.5867230.1199220.07647830.05067840.0369884comp.graphics
sci.medsci.cryptsoc.religion.christiancomp.graphicsalt.atheism0.2404140.1732540.1257780.08250430.0749882soc.religion.christian
sci.medsci.cryptrec.motorcyclescomp.graphicssci.electronics0.1290690.09510290.08201310.06942860.0616525sci.electronics
sci.medsci.cryptalt.atheismcomp.graphicssoc.religion.christian0.1765040.1003050.09808520.07497510.0527355talk.politics.misc
sci.medsci.crypttalk.politics.misctalk.politics.gunsalt.atheism0.1943130.1930720.1248840.1224380.105321alt.atheism
sci.medsci.spacerec.motorcyclesrec.autossci.electronics0.2836080.2372660.08826930.08027690.0499932sci.space
sci.medsci.spacetalk.politics.gunsalt.atheismtalk.religion.misc0.3260970.3055730.1165810.05133810.0367944talk.politics.guns
sci.medsci.spacerec.sport.baseballalt.atheismtalk.politics.mideast0.2147080.1137750.09319250.0736530.0684173rec.sport.baseball
sci.medsci.spacetalk.politics.gunsalt.atheismtalk.politics.misc0.08550260.08407180.07361330.07244890.0668283talk.politics.misc
sci.medsci.spacetalk.religion.miscalt.atheismsoc.religion.christian0.3219360.2549840.1765340.1655770.0339035talk.religion.misc
sci.medsci.spacerec.motorcyclescomp.windows.xcomp.sys.mac.hardware0.1083440.09491810.07982880.06726030.0631208rec.motorcycles
sci.medsci.spacerec.autossci.electronicstalk.politics.guns0.1491850.1406140.1359520.1151650.0701548sci.electronics
sci.medsci.spacerec.sport.baseballsci.electronicsrec.autos0.0980360.09465190.08389970.07424740.0720842sci.electronics
sci.medsci.spacecomp.windows.xrec.sport.baseballtalk.politics.mideast0.1010740.09784120.06933630.06740330.0669934sci.space
sci.medsci.spacerec.sport.baseballtalk.politics.gunssoc.religion.christian0.1496720.1114540.0896520.07937660.071511talk.religion.misc
sci.medsci.spacecomp.sys.mac.hardwaretalk.politics.miscrec.autos0.1184420.08337420.08000710.07366310.070822comp.windows.x
sci.medsci.spacesci.crypttalk.religion.miscalt.atheism0.1478930.1193790.08764620.0646350.06184talk.religion.misc

(rows: 2717, time: 1.2s, 1MB processed, job: job_ixkw-3eJ7XJRFSxhtF0Lzi8Y79rf)

Prediction

Local Instant Prediction

The MLWorkbench also supports running prediction and displaying the results within the notebook. Note that we use the non-evaluation model below (./train/model) which takes input with no target column.


In [93]:
%%ml predict
model: ./train/model/
headers: text
data:
  - nasa
  - windows xp


predicted predicted_2 predicted_3 predicted_4 predicted_5 probability probability_2 probability_3 probability_4 probability_5 text
sci.space rec.motorcycles rec.sport.baseball comp.graphics rec.autos 0.088903 0.063468 0.061663 0.060291 0.057231 nasa
comp.os.ms-windows.misc comp.graphics misc.forsale comp.windows.x rec.motorcycles 0.145055 0.067206 0.062420 0.062265 0.056137 windows xp

Why Does My Model Predict this? Prediction Explanation.

"%%ml explain" gives you insights on what are important features in the prediction data that contribute positively or negatively to certain labels. We use LIME under "%%ml explain". (LIME is an open sourced library performing feature sensitivity analysis. It is based on the work presented in this paper. LIME is included in Datalab.)

In this case, we will check which words in text are contributing most to the predicted label.


In [94]:
# Pick some data from eval csv file. They are cleaned text.
# The truth labels for the following 3 instances are
# - rec.autos
# - comp.windows.x
# - talk.politics.mideast

instance0 = ('little confused models [number] [number] heard le se someone tell differences far features ' +
            'performance curious book value [number] model less book value usually words demand ' +
            'year heard mid spring early summer best buy')
instance1 = ('hi requirement closing opening different display servers within x application manner display ' +
            'associated client proper done during transition problems')
instance2 = ('attacking drive kuwait country whose citizens close blood business ties saudi citizens thinks ' +
            'helped saudi arabia least eastern muslim country doing anything help kuwait protect saudi arabia ' +
            'indeed masses citizens demonstrating favor butcher saddam killed muslims killing relatively rich ' +
            'muslims nose west saudi arabia rolled iraqi invasion charge saudi arabia idea governments official ' +
            'religion de facto de human nature always ones rise power world country citizens leader slick ' +
            'operator sound guys angels posting edited stuff following friday york times reported group definitely ' +
            'conservative followers house rule country enough reported besides complaining government conservative ' +
            'enough asserted approx [number] [number] kingdom charge under saudi islamic law brings death penalty ' +
            'diplomatic guy bin isn called severe punishment [number] women drove public while protest ban women ' +
            'driving guy group said al said women fired jobs happen heard muslims ban women driving basis qur etc ' +
            'yet folks ban women called choose rally behind hate women allowed tv radio immoral kingdom house neither ' +
            'least nor favorite government earth restrict religious political lot among things likely replacements ' +
            'going lot worse citizens country house feeling heat lately last six months read religious police ' +
            'government western women fully stupid women imo sends wrong signals morality read cracked down few home ' +
            'based religious posted government owned newspapers offering money turns group dare worship homes secret ' +
            'place government grown try take wind conservative opposition things small taste happen guys house trying ' +
            'long run others general west evil zionists rule hate west crowd')

data = [instance0, instance1, instance2]



In [95]:
%%ml predict
model: ./train/model/
headers: text
data: $data


predicted predicted_2 predicted_3 predicted_4 predicted_5 probability probability_2 probability_3 probability_4 probability_5 text
rec.autos comp.sys.mac.hardware rec.sport.baseball sci.space soc.religion.christian 0.291265 0.196776 0.085414 0.054412 0.046243 little confused models [number] [numb...
comp.windows.x comp.graphics comp.sys.mac.hardware comp.os.ms-windows.misc sci.space 0.474711 0.072923 0.060589 0.048182 0.046209 hi requirement closing opening differ...
talk.politics.mideast talk.politics.guns talk.politics.misc alt.atheism sci.crypt 0.400145 0.337041 0.133643 0.122765 0.003076 attacking drive kuwait country whose ...

The first and second instances are predicted correctly. The third is wrong. Below we run "%%ml explain" to understand more.


In [96]:
%%ml explain --detailview_only
model: ./train/model
labels: rec.autos
type: text
data: $instance0



In [97]:
%%ml explain --detailview_only
model: ./train/model
labels: comp.windows.x
type: text
data: $instance1


On instance 2, the top prediction result does not match truth. Predicted is "talk.politics.guns" while truth is "talk.politics.mideast". So let's analyze these two labels.


In [98]:
%%ml explain --detailview_only
model: ./train/model
labels: talk.politics.guns,talk.politics.mideast
type: text
data: $instance2


Deploying Model to ML Engine

Now that we have a trained model, have analyzed the results, and have tested the model output locally, we are ready to deploy it to the cloud for real predictions.

Deploying a model requires the files are on GCS. The next few cells makes a bucket on GCS, copies the locally trained model, and deploys it.


In [99]:
!gsutil -q mb gs://bq-mlworkbench-20news-lab


ServiceException: 409 Bucket bq-mlworkbench-20news-lab already exists.

In [100]:
# Move the regular model to GCS
!gsutil -m cp -r ./train/model gs://bq-mlworkbench-20news-lab


Copying file://./train/model/assets.extra/features.json [Content-Type=application/json]...
Copying file://./train/model/saved_model.pb [Content-Type=application/octet-stream]...
Copying file://./train/model/variables/variables.index [Content-Type=application/octet-stream]...
Copying file://./train/model/variables/variables.data-00000-of-00001 [Content-Type=application/octet-stream]...
Copying file://./train/model/assets.extra/schema.json [Content-Type=application/json]...
- [5/5 files][901.2 KiB/901.2 KiB] 100% Done                                    
Operation completed over 5 objects/901.2 KiB.                                    

See this doc https://cloud.google.com/ml-engine/docs/how-tos/managing-models-jobs for a the definition of ML Engine models and versions. An ML Engine version runs predictions and is contained in a ML Engine model. We will create a new ML Engine model, and depoly the TensorFlow graph as a ML Engine version. This can be done using gcloud (see https://cloud.google.com/ml-engine/docs/how-tos/deploying-models), or Datalab which we use below.


In [101]:
%%ml model deploy
path: gs://bq-mlworkbench-20news-lab
name: news.alpha


Waiting for operation "projects/bradley-playground/operations/create_news_alpha-1508447816124"
Done.

How to Build Your Own Prediction Client

A common task is to call a deployed model from different applications. Below is an example of writing a python client to run prediction.

Covering model permissions topics is outside the scope of this notebook, but for more information see https://cloud.google.com/ml-engine/docs/tutorials/python-guide and https://developers.google.com/identity/protocols/application-default-credentials .


In [102]:
from oauth2client.client import GoogleCredentials
from googleapiclient import discovery
from googleapiclient import errors

# Store your project ID, model name, and version name in the format the API needs.
api_path = 'projects/{your_project_ID}/models/{model_name}/versions/{version_name}'.format(
    your_project_ID=google.datalab.Context.default().project_id,
    model_name='news',
    version_name='alpha')

# Get application default credentials (possible only if the gcloud tool is
#  configured on your machine). See https://developers.google.com/identity/protocols/application-default-credentials
#  for more info.
credentials = GoogleCredentials.get_application_default()

# Build a representation of the Cloud ML API.
ml = discovery.build('ml', 'v1', credentials=credentials)

# Create a dictionary containing data to predict.
# Note that the data is a list of csv strings.
body = {
    'instances': ['nasa',
                  'windows ex']}

# Create a request
request = ml.projects().predict(
    name=api_path,
    body=body)

print('The JSON request: \n')
print(request.to_json())

# Make the call.
try:
    response = request.execute()
    print('\nThe response:\n')
    print(json.dumps(response, indent=2))
except errors.HttpError, err:
    # Something went wrong, print out some information.
    print('There was an error. Check the details:')
    print(err._get_reason())


The JSON request: 

{"body": "{\"instances\": [\"nasa\", \"windows ex\"]}", "resumable_uri": null, "headers": {"content-type": "application/json", "accept-encoding": "gzip, deflate", "accept": "application/json", "user-agent": "google-api-python-client/1.6.2 (gzip)"}, "uri": "https://ml.googleapis.com/v1/projects/bradley-playground/models/news/versions/alpha:predict?alt=json", "resumable": null, "methodId": "ml.projects.predict", "body_size": 37, "resumable_progress": 0, "method": "POST", "_in_error_state": false, "response_callbacks": []}

The response:

{
  "predictions": [
    {
      "probability": 0.08890275657176971, 
      "probability_5": 0.05723080784082413, 
      "probability_4": 0.06029100343585014, 
      "predicted": "sci.space", 
      "probability_3": 0.06166268140077591, 
      "probability_2": 0.06346812099218369, 
      "predicted_2": "rec.motorcycles", 
      "predicted_3": "rec.sport.baseball", 
      "predicted_4": "comp.graphics", 
      "predicted_5": "rec.autos"
    }, 
    {
      "probability": 0.1440439522266388, 
      "probability_5": 0.060565147548913956, 
      "probability_4": 0.06096724420785904, 
      "predicted": "comp.os.ms-windows.misc", 
      "probability_3": 0.06115517392754555, 
      "probability_2": 0.06615085899829865, 
      "predicted_2": "comp.graphics", 
      "predicted_3": "misc.forsale", 
      "predicted_4": "comp.windows.x", 
      "predicted_5": "rec.motorcycles"
    }
  ]
}

To demonstrate prediction client further, check API Explorer (https://developers.google.com/apis-explorer). it allows you to send raw HTTP requests to many Google APIs. This is useful for understanding the requests and response, and help you build your own client with your favorite language.

Please visit https://developers.google.com/apis-explorer/#search/ml%20engine/ml/v1/ml.projects.predict and enter the following values for each text box.


In [103]:
# The output of this cell is placed in the name box
# Store your project ID, model name, and version name in the format the API needs.
api_path = 'projects/{your_project_ID}/models/{model_name}/versions/{version_name}'.format(
    your_project_ID=google.datalab.Context.default().project_id,
    model_name='news',
    version_name='alpha')
print('Place the following in the name box')
print(api_path)


Place the following in the name box
projects/bradley-playground/models/news/versions/alpha

The fields text box can be empty.

Note that because we deployed the non-evaluation model, our depolyed model takes a csv input which only has one column. In general, the "instances" is a list of csv strings for models trained by MLWorkbench.

Click in the request body box, and note a small drop down menu appears in the FAR RIGHT of the input box. Slect "Freeform editor". Then enter the following in the request body box.


In [104]:
print('Place the following in the request body box')
request = {'instances': ['nasa', 'windows xp']}
print(json.dumps(request))


Place the following in the request body box
{"instances": ["nasa", "windows xp"]}

Then click the "Authorize and execute" button. The prediction results are returned in the browser.

Cleaning up the deployed model


In [105]:
%%ml model delete
name: news.alpha


Waiting for operation "projects/bradley-playground/operations/delete_news_alpha-1508447899648"
Done.

In [ ]:
%%ml model delete
name: news

In [107]:
# Delete the GCS bucket
!gsutil -m rm -r gs://bq-mlworkbench-20news-lab


Removing gs://bq-mlworkbench-20news-lab/predict_results_eval.csv#1508447773068028...
Removing gs://bq-mlworkbench-20news-lab/model/assets.extra/features.json#1508447803765889...
Removing gs://bq-mlworkbench-20news-lab/model/assets.extra/schema.json#1508447803782795...
Removing gs://bq-mlworkbench-20news-lab/model/saved_model.pb#1508447803954460...
Removing gs://bq-mlworkbench-20news-lab/model/variables/variables.data-00000-of-00001#1508447804049817...
Removing gs://bq-mlworkbench-20news-lab/model/variables/variables.index#1508447803769614...
/ [6/6 objects] 100% Done                                                       
Operation completed over 6 objects.                                              
Removing gs://bq-mlworkbench-20news-lab/...

In [108]:
# Delete BQ table

bq.Dataset('newspredict').delete(delete_contents = True)